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ABSTRACT 

THIS PAPER PROPOSES A DATA MANAGEMENT TECHNIQUE 
WHICH FREES THE LINGUIST'S TIME FROM PURELY MECHANICAL FUNCTION 
BETTER PERFORMED THROUGH PURELY MECHANICAL MEANS, THUS PERMITTING THE 
LINGUIST TO UTILIZE THE SAVINGS IN TIME TO PERFORM THOSE TASKS iOR 
WHICH HE IS MOST HIGHLY TRAINED., THE NUMBER OF COMBINATIONS POSSIBLE 
IN TERMS OF COLUMNS ON THE DATA PROCESSING CARD AND CHARACTERS ON THE 
KEY-PUNCH MACHINE IS SUFFICIENT TO ACCOMMODATE A GREAT VARIETY OF 
APPROACHES TO ANALYSIS; THE LINGUIST MAY ADAPT THE AVAILABLE CARD 
SPACE AND CHARACTER VARIETY IN ANY WAY HE DEEMS APPROPRIATE. IT IS 
NOT NECESSARY FOR THE LINGUIST HIMSELF TO BE FULLY ACQUAINTED WITH 
THE ACTUAL FUNCTIONING OF THE KEY-PUNCH, THE SORTER, THE PRINTER 
AND/OR REPRODUCER (ALTHOUGH NONE REQUIRES MORE THAN A FEW MINUTES 
INTRODUCTION IN ORDER TO OPERATE IT) ; A TECHNICIAN MAY BE PROVIDED 
WITH A PRECISE OUTLINE OF THE PRINT-OUTS THE LINGUIST WISHES TO SEE 
AND PERFORM ALL OF THE NECESSARY OPERATIONS FOR THE LINGUIST. AN 
EXAMINATION OF REDUCED AND SIMPLIFIED DATA CORRESPONDING TO TWO 
HYPOTHETICAL RESEARCH QUESTIONS SERVES TO ILLUSTRATE THE PROCEDURES 
INVOLVED. THE FIRST IS AN APPLICATION OF THE METHOD TO A STUDY OF 
LEXICAL VARIATION, AND THE SECOND AN APPLICATION TO AN INVESTIGATION 
INTO THE SYNTAX OF BRAZILIAN PORTUGUESE. (AUTHORS/DO) 
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Introduction 

The unprecedented theoretical advances which linguists claim 
have characterized their discipline in recent years have not been 
matched by appropriate consideration of the very subject matter to 
which such theorizing must eventually refer, i.e., linguistic data. 

It is only axiomatic that observation precedes theorizing although it 
would be to ignore the history of science to interpret "observation" 
in a restricted sense. Introspective "observation" or "hunches" have 
led to fruitful theorizing in many fields, but progress in theory- 

• 

making is a function of the verification of the hypotheses which a 
particular theory generates. That is, theory construction should not 
be equated with pyramid building of .hunches upon hunches . This paper 
addresses itself, then, to the question of data handling and manage- 
ment, for finally, it is data that justifies theorizing. 

The Management of Data 

The linguist engaged in the tedious and often error-prone task 
of assembling large quantities of data for purposes of analysis may be 
unaware of the assistance afforded by the peripheral equipment (e.g., 
card sorter, printer) associated with computers— computers per se are 



not the subject of this paper. The task of tabulating and coding lin- 



guistic information is often characterized by inaccuracies and delays 
which inevitably result from the sheer manual labor involved in pre- 
paring the data for inspection. The present discussion is concerned 
with the use of "para-computer" devices as aids in the practical 
matters of time and error reduction. 

The preparation of 5 x 5 index cards^ for example, is a widely 
used technique often involving (depending on the particular study), 
transcription of responses elicited, glosses, and certain mnemonic or 
coding devices which provide the analyst with categorical information 
of predetermined types. Recording and coding the same information for 
transfer to a data processing card require the same effort on the part 
of the linguist, and the resulting deck of cards is for all practical 
purposes equivalent to the familiar 5x5 pack. But here the simi- 
larity ends. Once the 5x5 card is prepared, the linguist must then 
begin the task of shuffling through the cards, assembling various sets 
for inspection, and recording the results of such groupings. Con- 
versely, the completion of a data processing card essentially repre- 
sents the completion of the linguist's manual labor. The task of 
grouping cards is more efficiently performed by the card sorter — a 
large number of relatively error-free combinations may be assembled In 
a matter of minutes. The results of sorting may then be rapidly trans- 
posed into an organized format by the printer, resulting in a clean 
copy from which the linguist can draw data for analysis. The same 
deck of cards may be used repeatedly for different sorts and print-outs 



without introducing the error factor characteristic of repeated manual 
operations. In addition^ identical decks of data cards are easily 
obtained through the use of the reproducer in a matter of minutes. 

While the intent here is not to offer any highly sophisticated 
exploitation of the full potentialities of the computer as such^ our 
suggestions nevertheless have two very positive aspects: (l) reduc- 

tion of the linguist's time investment in error-prone "busy work" 
which is performed much faster by a machine anyway; and (2) elimina- 
tion of errors--once a data processing card has been prepared and veri- 
fied^ there can be no further introduction of error because it remains 
constant^ and the print -outs faithfully reflect the information it 
contains. Further, the linguist may reap these benefits without in- 
struction or preparation in computer science, and without expensive 
use of 'computer time," It should also be mentioned that the number 
of symbols or characters available on the keyboard of a key-punch 
machine are generally similar to those of a standard typewriter which 
linguists have adapted to render their transcriptions irrespective of 
the particular language involved. 

Specific Examples 

An examination of reduced and simplified data corresponding to 

two hypothetical research questions may serve to illustrate the pro- 
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cedures involved. 

A. Application of the method to a study of lexical variation. 

In this example, linguistic characters which are not available 
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on the keyboard of key-punch machines will be replaced by available 
characters as follows: 

0 = ) 

^ = / (immediately following stressed syllable 
pot adhering to a general rule) 

s = $ 

(it should be noted, in addition, that all alphabetic characters on 
the key-punch machine are "upper case.") 

Let us assume a hypothetical dialect study of Brazilian Portu- 
guese. We are concerned with five Subject (S) variables, namely, 

( 1 ) age (defined as a range), ( 2 ) ethnic origin, ( 5 ) geographical 
region in which S resides, (i|) sex, and (5) socio-economic status (SES). 
Table 1 outlines for each of these variables the different categories 
con 5 )rehended therein and the numeric codes used for their identifica- 
tion. For each variable one column of the data card is assigned, and 
the particular code in that coliMn identifies the category of interest 
for that variable (e.g.. Code 2 in Column 2 represents the "Negro” 
category of the ethnic variable) . 

Insert Table 1 About Here 

In this example, the five variables will be assigned specific 
columns as follows: Column 1, age; Column 2, ethnic origin; Column 3 , 

geographic region; Colimin 4, sex; Column 5 , SES; and Columns 6-9 inclu- 
sive are set aside for S identification numbers (i.e., 0001-9999), 

Let us now consider this column assignment as it relates to Table 1 
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above. A data card having the numeric characters 322110311 in the 
first nine columns would tell us that this s age falls in the range 
"13-19 years old" (Code 3, Column l), that he is a Negro (Code 2, 
Column 2 ), from the State of Bahia (Code 2, Column 3), a male (Code 1, 
Column 4), and of low socio-economic status (Code 1, Column 5), having 
0311 for his personal identification number (Columns 6-9). 

After coding the ^ variables , there remain seven1y-one columns 
for the coding of an equivalent number of characters representing 
whatever data we may wish to include. In this example we will leave 
Column 10 blank and assign to Columns 11-52 the coding in alphabetic 
(phonemic) characters of a given response provided by the S. (This 
colu mn assignment is^ of course^ a function of the space requirements 
of given data.) We will leave Columns 55”55 blank and assign to 
Columns 56-75 the coding of the English equivalents of the responses, 
using again alphabetic characters (this time orthographic). Columns 
76 and 77 remain blank, and in Columns 78-8O we will code the mamber 
of the item from our instrument (e.g., questionnaire) to which the 
particular response coded in the space provided between Columns 11 
and 52 is related. 

Assuming that we have collected data on 2,000 S^s and coded 
them on data processing cards, we will now randomly select 21 of these 
data cards to illustrate the data management technique under discus- 
sion. Figure 1 depicts these 21 cards grouped according to question- 
naire item (Columns 78-8O) . 



Insert Figure 1 About Here 



At this point, a brief review is indicated. If Figure 1 and 
Table 1 are examined in conjunction, the power of the technique be- 
comes obvious, using the codes in Table 1 the reader can examine 
any one of the cards listed in Figure 1 and immediateiy ’’translate” 
the first five columns which specify the characteristics of the S as 
described by our five variables. Columns 11-52 contain the S's re- 
sponse, in alphabetic characters, to the stimulus item depicted in 
alphabetic characters in Columns 56-75. The questionnaire item number 
of the stimulus appears in Columns 78-8O. To wit, the third card that 
appears in Figure 1 is that of S 1711 (Columns 6-9) who is over 50 
years of age (Code 6, Column 1), of Italian descent (Code 4, Column 2), 
living in the geographical region represented by the states of Parana, 
Santa Catarina and Rio Grande do Sul (Code 6, Column 5), a female 
(Code 2, Column and of Upper-Lower socio-economic status (Code 2, 

Column 5) . To Item I05 (Columns 78-80) which was the stimulus "over- 
coat" (Columns 56-75 this S provided the lexical item "SOBRETUDO” 
(Columns 11-19) . 

Suppose now that a question arises as to whether there exists 
an age difference in the responses of those Ss who make up this small 
sample. Since the age variable is coded in Column 1 of the cards, the 
answer is easily obtained by a three-step procedure: (1) Set the 

sorter on Columns 80, 79 and "JQ, consecutively (performing the sorting 
operation each time) . After the third run through the sorter the 
cards will be grouped as shown in Figure 1. Note that the question- 
naire item numbers are in sequence (from O39-IO3, Columns 78-80).^ 
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(2) Set the sorter on Column 1 (the age-code column) and operate it on 
each of the questionnaire item groups sorted in Step 1. Using our 
data deck, this operation would require four runs through the sorter, 
one run for each of the questionnaire items. These four runs result 
in the cards being grouped as shown in Figure 2. (3) Using the printer, 

print out the result of the sorting operation. Figure 2 was obtained 
in this manner. (The "extra" space between item-groups is obtained 
by simply inserting a blank card after the last card of each group.) 

Insert Figure 2 About Here 

It will be noted that Columns 2 (ethnic origin) and 3 (geo- 
graphical region) for given lexical stimuli provide us with added in- 
formation concerning the responses obtained when these are grouped in 
terms of the age categories. For example, of the ten Ss whose re- 
sponses to the stimulus item 095 ("mosquito") are represented in 
Figure 2, two each (0281,0999), (0191,0023), (0301,1129), and (1007, 

1501) fall within age groups 2,4,5, and 6, respectively. Examining 
their responses, we observe that the only instance where two Ss fall- 
ing within the same age range responded in like manner to this stimu- 
lus item was in the case of the two Ss between 20 and 35 years of age 
(Code 4, Column l) who both responded with "KAIUVPANAN/." For the 
other three instances, separation by age did not provide consistency 
of response to this item. It would thus appear (recognizing, of course, 
the "illustrative nature" of this conclusion) that age does not sig- 
nificantly discriminate responses to this item. 



o 
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Further inspection of the responses to this item (095) reveals 
that those Ss providing the response "PEXNILONGO" have either Code 1 
(Northeast) or Code 2 (Bahia) in Column 5; those Ss providing the 
response "KARAPANAN/" have Code 5 (Amazon Basin) and Code 8 (Goias, 
Mato Grosso) in Column 5; and of those Ss providing "MOSKITO," two 
have Code 5 (Sao Paulo), one Code 7 (Minas Gerais), one Code 4 (Rio 
de Janeiro), and one Code 6 (Parana, Santa Catarina, Rio Grande do 
Sul) in this column. 

A hypothesis might be proposed based on this example that geo- 
graphical region as a variable provides more useful information than 
age (or other variables) in classifying S responses. It might be sug- 
gested that the eight geographical regions originally set up and coded 
in Column 5 be further reduced to three "dialect areas" within the 
boundaries of which responses might generally be expected to be con- 
sistent regardless of ethnic background, age, sex or SES. We coilLd 
test the viability of these dialect boundaries by making use of one of 
the blank columns on a copy deck of our original deck of data cards. 

We would first sort our cards on Column 5 and then group those cards 
having in that column codes 1 and 2, 3 and 8, and 4, 5, 6, and 7, 
respectively. These new groups of data cards might then be "gang 
punched" (i.e., all of those cards to receive the same symbol are 
punched in one pass through the reproducer) . The reproducer can be 
programmed to punch any symbol in any unused column of the dataTcards. 
In our example problem. Column 54 was arbitrarily chosen. It is clear 
at this point that it is to the researcher's advantage to leave certain 



data card columns blank in case the need for additional coding arises . 
The first group might be punched with "1" in Column the second 

with "2,” and the third with We are then prepared to test our 

dialect boundary hypothesis by sorting our data deck within each of 
the groupings of Column 5^ (i.e.. Codes 1, 2 and ^), according to item 
numbers (Columns 78“80) • To perform this operation, the sorter would 
be set on Column 5^ and each of the four item groups would be run 
through the machine. With our sample data, four runs would be re- 
quired. Figure 5 illustrates how the data would look after being 
printed. 

Insert Figure 5 About Here 

If we find general agreement for the responses to the items 
within the three separate areas we have determined (and coded in 
Column 5^), we may consider the regional divisions viable ones. If, 
on the other hand, great diversity is found within a given dialect 
area (for responses to various items), we might generate other group- 
ings which may more faithfully reflect geographical variations in the 
use of the lexical items under investigation. We may ultimately re- 
ject the viability of geographical region alone as a significant pre- 
dictor of responses to the items in question, a combination of factors 
(variables) being necessary before specification is possible. 

It should be noted in passing that the more specific our cod- 
ing system is at the beginning of our analysis, the greater the number 
of hypotheses we may generate and test by the procedure just described. 
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That is, if we had, in our initial coding, provided only three codes 
for geographical regions, we could not have subsequently decided to 
invoke the eight regions illustrated in Column 5 oi* Table 1 since they 
would not have been available. 

B. Application of the method to an investigation into the syntax of 
Brazilian Portuguese. 

In this example, a roughly morphemic transcription will be 

5 

employed (cf. Slobin, 1967^ pp. 215"23.4). Aspects of phonology 
which are deemed particularly relevant to the research question might 
have been recorded in phonemic or phonetic transcription using slashes 
(/) or some other device for identification of such departures from 
the general approach; for this example, however, all responses are 
assumed "to reflect standard pronunciation" (Slobin, 1967^ p. 2l4), 
and the transcription used here will be limited to a morphemic one. 

Let us assume now a hypothetical syntactic analysis of the 
negative in Brazilian Portuguese based on a series of research hypoth- 
eses expressed in questionnaire format. The principal variables in 
this case are not used to describe the S, as in Example A, but rather, 
reflect each response provided by the S. The variables of interest 
are shown in Table 2. 

Insert Table 2 About B[ere 

So, for example, a data card having the numeric characters 12255221 in 
Columns 1»8 tells us that the string for analysis is of syntactic in- 
terest (Code 1, Column 1)— in the case that analyses of other aspects 
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of the grammar had been, vere presently, or would be, performed; that 
the sentence (or part thereof) in question is, in addition to being 
negative, also interrogative (Code 2, Column 2); that the sentence 
contains a direct object which is a noun (Code 2, Column 3); that the 
sentence contains both an adverb of place and an adverb of time 
(Code 5^ Column 4); that the person of the main verb of the sentence 
is third person singular (Code 5^ Column 5)j that the verb is in the 
past (Code 2, Column 6), perfect (Code 2, Column 7), indicative 
(Code 1, Column 8). Further, if in Columns 76”80, the card reveals 
the code 02547, the information supplied Includes: the sentence con- 

tains a simple ’not” (Code 0, Column J6 ) ; the sentence was supplied by 
Subject 2 (Code 2, Column 77) ^ whose defining characteristics would 
have already been noted elsewhere if considered relevant to the analy- 
sis in question; and that the item number is 547 (5,4,7 in Columns 78, 
79 and 80, respectively), permitting immediate location of a specific 
sentence stimulus. 

Let us assume that the research question regarding negative 
constructions in Portuguese has been developed to the point cf elicit- 
ing certain illustrative or representative sentences from a native 
speaker (who, of courpe, may be the linguist himself). These sen- 
tences are then coded and prepared for the data processing card. A 
sample of 14 cards in a print- out is presented in Figure 4. 

Insert Figure 4 About Here 



Suppose now that the issue in question is the effect, if any. 



12 



of a direct object upon the order of elements in a negative statement. 
Since the . Direct Object variable is in Column 5 of our cards, rele- 
vant information is readily obtainable by following the procedure de- 
scribed above for the age variable in Example A (i.e., sorting and 
printing) . The resulting print-outs provide the data grouped accord- 
ing to the presence or absence of a direct object, and in the grouping 
indicating the presence of a direct object, the various sub-categories 
(e.g., noun, pronoun, or indefinite pronoun direct objects) are iso- 
lated. 

In the same manner other variables may be isolated and pre- 
pared for inspection by appropriate sorting and the resulting print- 
outs. 

It will be noted in the above example that several columns 
have been left blank (e.g.. Columns 9-13) • It will also be noted that 
the example here presented is quite simplified. Further specification 
of linguistic variables of relevance is provided for by these blank 
columns. The analysis might begin as outlined here. As it proceeds, 
the linguist may wish to add further relevant information in the blank 
columns as was exemplified by the inclusion of a hypothesized dialect 
boundary in Column of Figure 3 - There is, in short, a flexibility, 
which permits the linguist to begin by testing one set of hypotheses, 
to add others as he progresses, and to rest assured that quality con- 
trol is, in fact, the forte of the technique. 

To question or quarrel with the inclusion or exclusion of a 
specific linguistic variable in the examples set forth here is not 
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really relevant to the issue of using computer related peripheral 
equipment; it is rather, a commentary on the linguistic analysis it- 
self. The latter is vithin the domain and responsihilily of each in- 
dividual linguist to determine. The more sophisticated the linguist *s 
analysis, the more sophisticated his coding system mi^t he. The 
above examples mereiy attempt to provide hypothetical analyses vhich 
serve to illustrate the proposed data management technique. 

Summary 

What is proposed is a data management technique which frees 
the linguist's time from purely mechanical functions better performed 
throu^ purely mechanical means, thus permitting the linguist to 
utilize the savings in time to perform those tasks for which he is 
most highly trained. The niimber of combinations possible in terms of 
columns on the data processing card and characters on the key-punch 
machine is sufficient to accommodate a great variety of approaches to 
analysis; the linguist may adapt the available card space and charac- 
ter variety in aiy way he deems appropriate. It is not necessary for 
the linguist himself to be fully acquainted with the actual function- 
ing of the key-pimch, the sorter, the printer and/or reproducer (al- 
thou^ none requires more than a few minutes introduction in order to 
operate it); a technician may be provided with a precise outline of 
the print-outs the linguist wishes to see and perform all of the neces- 
sary operations for the linguist. 

It should be noted that the tasks of analysis, decision making. 



Ik 

and the like are in no way eliminated or even lessened through the use 
of this technique- -the linguist obviously cannot simply toss unorgan- 
ized data into the equipment and expect an analysis to emerge. If 
aiQrbhing, "the use of peripheral data processing equipment may force 
the linguist to greater precision and consistency in his preparation 
of data for analysis. What must be emphasized^ however, is that this 
proposal frees the linguist from the drudgery which too often accom- 
panies linguistic analysis (and may result in reduced efficiency in 
the analysis iuseli") and enables him to devote more time smd energy 
to his principal task — the analysis of the data. 
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Footnotes 



1. The authors gratefully acknowledge the assistance provided by 
Dr . Donald J . Veldman in making available the data-processing 
facilities of* the Research and Development Center for Teacher 
Education of the University of Texas at Austin in the validation 
stages of the technique herein described. 

2. The exaniples provided represent procedures actually carried outj 
the print-outs \Ahich appear as figures are the result of the 
operations performed in the validation of the procedures described. 

3 . For a more con^rehensive discussion of card punching, reproducing 
and sorting techniques, see Veldman, Donald J., Fortran Progr^^m- 
ming for the Behavioral Sciences, New York: Holt, Rinehart and 
Winston, 1967- 

4 . The apparent simplicity depicted in Figure 2 and in the other 
examples is due to the small sample represented (l.e., 21 data 
cards). With a larger san^le, the power of the technique becomes 
all the more obvious. 

5 . Slobin, Dan I. (Ed.) A Field Manual for Cross-cultural Study of 
the Acquisition of Communicative Competence, University of Cali- 
fornia, Berkeley, I 967 (second draft). 
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